It is well known that an excessive alcohol consumption can cause the increased risk of health problems. However, alcoholic beverages are highly available in any european country and the majority of people can not imagine a party without an alcohol. A lot of people believe that moderate alcohol intake can actually help them to release stress and feel more relaxed. But others avoid alcohol consumption not to harm their physical and psychological health. This project aims to reveal the truth - is there a relation between the frequency of alcohol consumption and current depressive symptoms in Europe?
The data is generated by the statistical office of the European Union - Eurostat. Please load the dataset of alcohol consumption in csv format.
Please also load the dataset of depressive symptoms in csv format. The general coverage of the survey is the population aged 15 or over living inside European Union. The surveys were conducted in 2014 and 2019 separately for males and females with different educational levels. The data is given in percentage of population per country.
If OBS_FLAG is set to "u" it means that the data has low reliability. It was decided to exclude such data from the statistical tests.
import numpy as np
import yaml
import pandas as pd
def get_config():
with open("config.yaml", 'r') as stream:
return yaml.safe_load(stream)
config = get_config()
def read_csv(yaml_key):
filepath = (config[yaml_key])
if not filepath.endswith('.csv'):
raise Exception
df = pd.read_csv(filepath)
return df
alcohol_df = read_csv('alcohol_consumption')
depression_df = read_csv('depressive_symptoms')
alcohol_df.head()
| DATAFLOW | LAST UPDATE | freq | unit | frequenc | isced11 | sex | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ESTAT:HLTH_EHIS_AL1E(1.0) | 01/04/22 23:00:00 | A | PC | DAY | ED0-2 | F | TOTAL | AT | 2014 | 2.0 | NaN |
| 1 | ESTAT:HLTH_EHIS_AL1E(1.0) | 01/04/22 23:00:00 | A | PC | DAY | ED0-2 | F | TOTAL | AT | 2019 | 2.3 | NaN |
| 2 | ESTAT:HLTH_EHIS_AL1E(1.0) | 01/04/22 23:00:00 | A | PC | DAY | ED0-2 | F | TOTAL | BE | 2014 | 7.5 | u |
| 3 | ESTAT:HLTH_EHIS_AL1E(1.0) | 01/04/22 23:00:00 | A | PC | DAY | ED0-2 | F | TOTAL | BE | 2019 | 4.0 | NaN |
| 4 | ESTAT:HLTH_EHIS_AL1E(1.0) | 01/04/22 23:00:00 | A | PC | DAY | ED0-2 | F | TOTAL | BG | 2014 | 1.3 | NaN |
depression_df.head()
| DATAFLOW | LAST UPDATE | freq | unit | isced11 | hlth_pb | sex | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ESTAT:HLTH_EHIS_MH1E(1.0) | 29/10/21 23:00:00 | A | PC | ED0-2 | DPR | F | TOTAL | AT | 2014 | 7.5 | NaN |
| 1 | ESTAT:HLTH_EHIS_MH1E(1.0) | 29/10/21 23:00:00 | A | PC | ED0-2 | DPR | F | TOTAL | AT | 2019 | 10.9 | NaN |
| 2 | ESTAT:HLTH_EHIS_MH1E(1.0) | 29/10/21 23:00:00 | A | PC | ED0-2 | DPR | F | TOTAL | BE | 2019 | 13.0 | NaN |
| 3 | ESTAT:HLTH_EHIS_MH1E(1.0) | 29/10/21 23:00:00 | A | PC | ED0-2 | DPR | F | TOTAL | BG | 2014 | 14.2 | NaN |
| 4 | ESTAT:HLTH_EHIS_MH1E(1.0) | 29/10/21 23:00:00 | A | PC | ED0-2 | DPR | F | TOTAL | BG | 2019 | 10.7 | NaN |
If the column contains one unique value within the whole data this column is considered redundant.
def checkTheAmountOfUniqueValues(df, columnNames):
res = []
for columnName in columnNames:
res.append(len(df[columnName].unique()))
return res
# check that values are redundant (only one unique for all data)
redundant_column_names = ["DATAFLOW", "LAST UPDATE", "freq", "unit"]
redundant_alcohol = checkTheAmountOfUniqueValues(alcohol_df, redundant_column_names)
redundant_depression = checkTheAmountOfUniqueValues(depression_df, redundant_column_names)
print(f'Amount of unique values for {redundant_column_names} in alcohol consumption table: {redundant_alcohol}')
print(f'Amount of unique values for {redundant_column_names} in depressive symptoms table: {redundant_depression}')
# get rid of redundant columns
alcohol_df.drop(columns =redundant_column_names, inplace = True)
depression_df.drop(columns =redundant_column_names, inplace = True)
alcohol_df.head()
Amount of unique values for ['DATAFLOW', 'LAST UPDATE', 'freq', 'unit'] in alcohol consumption table: [1, 1, 1, 1] Amount of unique values for ['DATAFLOW', 'LAST UPDATE', 'freq', 'unit'] in depressive symptoms table: [1, 1, 1, 1]
| frequenc | isced11 | sex | age | geo | TIME_PERIOD | OBS_VALUE | OBS_FLAG | |
|---|---|---|---|---|---|---|---|---|
| 0 | DAY | ED0-2 | F | TOTAL | AT | 2014 | 2.0 | NaN |
| 1 | DAY | ED0-2 | F | TOTAL | AT | 2019 | 2.3 | NaN |
| 2 | DAY | ED0-2 | F | TOTAL | BE | 2014 | 7.5 | u |
| 3 | DAY | ED0-2 | F | TOTAL | BE | 2019 | 4.0 | NaN |
| 4 | DAY | ED0-2 | F | TOTAL | BG | 2014 | 1.3 | NaN |
Check if there are Nan values and remove them from the data.
# drop nan values
check_nan_values_alcohol = alcohol_df['OBS_VALUE'].isna().values.any()
check_nan_values_depr_sympts = depression_df['OBS_VALUE'].isna().values.any()
print(f'There are NaN values in alcohol consumption table - {check_nan_values_alcohol} and in depressive symptoms table - {check_nan_values_depr_sympts}')
alcohol_df.dropna(subset=['OBS_VALUE'], inplace=True)
depression_df.dropna(subset=['OBS_VALUE'], inplace=True)
There are NaN values in alcohol consumption table - True and in depressive symptoms table - True
Check what data we are working with - what categories for location, age, educational level and time period we have. Compare it within two tables and prepare for being merged.
def filter_rows_by_values(df, columnName, values):
return df[df[columnName].isin(values)]
locations_in_total_al = alcohol_df["geo"].unique()
locations_in_total_dep = depression_df["geo"].unique()
print(f'Possible locations: {locations_in_total_al}, len: {len(locations_in_total_al)}')
print(f'Possible locations are the same?: {np.array_equal(locations_in_total_al, locations_in_total_dep)}')
# let's use only the statistics for separate countries
countries = [country for country in locations_in_total_al if len(country ) == 2]
alcohol_df = filter_rows_by_values(alcohol_df, "geo", countries)
depression_df = filter_rows_by_values(depression_df, "geo", countries)
# let's check possible age groups
age_in_total_al = alcohol_df["age"].unique()
age_in_total_dep = depression_df["age"].unique()
print(f'Possible age groups for alcohol consumption table: {age_in_total_al}, len: {len(age_in_total_al)}')
print(f'Possible age groups for depressive symptoms table: {age_in_total_dep}, len: {len(age_in_total_dep)}')
# 'Y25-64' is missing in the depressive symptoms table
alcohol_df.drop(alcohol_df[alcohol_df['age'] == 'Y25-64'].index, inplace=True)
# let's check possible education levels
edu_in_total_al = alcohol_df["isced11"].unique()
edu_in_total_dep = depression_df["isced11"].unique()
print(f'Possible education levels are the same?: {np.array_equal(edu_in_total_al, edu_in_total_dep)}')
print(f'Possible education levels: {edu_in_total_al}, len: {len(edu_in_total_al)}')
time_in_total_al = alcohol_df["TIME_PERIOD"].unique()
time_in_total_dep = depression_df["TIME_PERIOD"].unique()
print(f'Possible time periods are the same?: {np.array_equal(time_in_total_al, time_in_total_dep)}')
print(f'Possible time periods: {time_in_total_al}, len: {len(time_in_total_al)}')
Possible locations: ['AT' 'BE' 'BG' 'CY' 'CZ' 'DE' 'DK' 'EE' 'EL' 'ES' 'EU27_2020' 'EU28' 'FI' 'FR' 'HR' 'HU' 'IE' 'IS' 'IT' 'LT' 'LU' 'LV' 'MT' 'NL' 'NO' 'PL' 'PT' 'RO' 'RS' 'SE' 'SI' 'SK' 'TR' 'UK'], len: 34 Possible locations are the same?: True Possible age groups for alcohol consumption table: ['TOTAL' 'Y15-19' 'Y15-24' 'Y15-29' 'Y15-64' 'Y18-24' 'Y18-44' 'Y18-64' 'Y20-24' 'Y25-29' 'Y25-34' 'Y25-64' 'Y35-44' 'Y45-54' 'Y45-64' 'Y55-64' 'Y65-74' 'Y_GE18' 'Y_GE65' 'Y_GE75'], len: 20 Possible age groups for depressive symptoms table: ['TOTAL' 'Y15-19' 'Y15-24' 'Y15-29' 'Y15-64' 'Y18-24' 'Y18-44' 'Y18-64' 'Y20-24' 'Y25-29' 'Y25-34' 'Y35-44' 'Y45-54' 'Y45-64' 'Y55-64' 'Y65-74' 'Y_GE18' 'Y_GE65' 'Y_GE75'], len: 19 Possible education levels are the same?: True Possible education levels: ['ED0-2' 'ED3_4' 'ED5-8' 'TOTAL'], len: 4 Possible time periods are the same?: True Possible time periods: [2014 2019], len: 2
As was mentioned above, there is the flag value - 'u' - which means that the data has low reliability. It was decided to eliminate such values from the data to get the most accurate results conducting statistical tests.
# get rid of the data with low reliability
low_reliability_values_alc = alcohol_df[alcohol_df['OBS_FLAG'] == 'u']
low_reliability_values_dep = depression_df[depression_df['OBS_FLAG'] == 'u']
print(f'There are {len(low_reliability_values_alc)} low reliability values out of {len(alcohol_df)}')
print(f'There are {len(low_reliability_values_dep)} low reliability values out of {len(depression_df)}')
alcohol_df.drop(alcohol_df[alcohol_df['OBS_FLAG'] == 'u'].index, inplace=True)
depression_df.drop(depression_df[depression_df['OBS_FLAG'] == 'u'].index, inplace=True)
# we don't need the flags anymore
alcohol_df.drop("OBS_FLAG", axis='columns', inplace=True)
depression_df.drop("OBS_FLAG", axis='columns', inplace=True)
There are 7131 low reliability values out of 89230 There are 1692 low reliability values out of 31344
Combine data to create a dataframe with values of interest for the statistical analysis. Check the categories for the health problems (depressive symptoms), frequency of alcohol consumption and sex. To answer the research question we don't need to separate people by age, sex or educational level, so we fetch the data with 'TOTAL' values for these categories. There is no total value for the health problems or alcohol consumption frequency, therefore we use most common depressive symptoms and the daily alcohol consumption for these categories respectively.
statistics = alcohol_df.merge(depression_df, how='inner', on=['geo', 'sex', 'age', 'TIME_PERIOD', 'isced11'], suffixes=('_alc', '_dep'))
health_problems = depression_df["hlth_pb"].unique()
print(f'Possible health problems: {health_problems}, len: {len(health_problems)}')
frequencies_al = alcohol_df["frequenc"].unique()
print(f'Possible frequencies: {frequencies_al}, len: {len(frequencies_al)}')
frequencies_al = alcohol_df["sex"].unique()
print(f'Possible sex: {frequencies_al}, len: {len(frequencies_al)}')
def getTable(df, dep_symptoms, age, sex, time_period, frequency, edu_level):
return df[(df['hlth_pb'] == dep_symptoms) & (df['age'] == age) & (df['sex'] == sex) & (df['TIME_PERIOD'] == time_period) & (df['frequenc'] == frequency) & (df['isced11'] == edu_level)]
res_2014 = getTable(statistics, 'DPR', 'TOTAL', 'T', 2014, 'DAY', 'TOTAL')
res_2014.head()
Possible health problems: ['DPR' 'DPR_MJR' 'DPR_OTH'], len: 3 Possible frequencies: ['DAY' 'LT1M' 'MTH' 'NM12' 'NVR' 'NVR_NM12' 'WEEK'], len: 7 Possible sex: ['F' 'M' 'T'], len: 3
| frequenc | isced11 | sex | age | geo | TIME_PERIOD | OBS_VALUE_alc | hlth_pb | OBS_VALUE_dep | |
|---|---|---|---|---|---|---|---|---|---|
| 179613 | DAY | TOTAL | T | TOTAL | AT | 2014 | 6.2 | DPR | 5.0 |
| 179676 | DAY | TOTAL | T | TOTAL | BG | 2014 | 8.9 | DPR | 7.9 |
| 179718 | DAY | TOTAL | T | TOTAL | CY | 2014 | 4.2 | DPR | 4.6 |
| 179760 | DAY | TOTAL | T | TOTAL | CZ | 2014 | 9.5 | DPR | 3.2 |
| 179802 | DAY | TOTAL | T | TOTAL | DE | 2014 | 9.3 | DPR | 8.5 |
Here we plot the data for both 2014 and 2019 to represent the correlation between the daily alcohol consumption and the development of depressive symptoms for the countries inside EU. The data is given in percentage of population per certain country. The stacked bar plot was chosen for data illustration to show the numerical values across two datasets at the same time in a comprehensive way.
import matplotlib.pyplot as plt
def showPlot(df, year):
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
countries = df['geo']
y1, y2 = df['OBS_VALUE_alc'], df['OBS_VALUE_dep']
ax.bar(countries, y1, color = 'b', alpha=0.8)
ax.bar(countries, y2, bottom=y1, color = 'r', alpha=0.8)
plt.xlabel("Countries")
plt.ylabel("Percentage of population")
plt.legend(["Alcohol consumption", "Depressive symptoms"])
plt.title(f"Correlation between alcohol consumption on daily basis and the depressive symptoms development in {year} inside EU")
plt.show()
showPlot(res_2014, 2014)
res_2019 = getTable(statistics, 'DPR', 'TOTAL', 'T', 2019, 'DAY', 'TOTAL')
showPlot(res_2019, 2019)
Check for the data normality using Shapiro–Wilk test. Use a Levene’s & Bartlett’s Test of Equality (Homogeneity) of Variance to test equal variance. Conduct the paired samples t-test for the alcohol consumption and depressive symptoms values for both years.
from scipy import stats
from scipy.stats import levene
def conductTests(df):
y1, y2 = df['OBS_VALUE_alc'], df['OBS_VALUE_dep']
# (for 2014) the p-value is 0.002773334039375186 which is less than the alpha(0.05) - the sample does not come from a normal distribution.
print(stats.shapiro(y1))
# (for 2014) the p-value is 0.15160074830055237 which is NOT less than the alpha(0.05) - the sample comes from a normal distribution.
print(stats.shapiro(y2))
# (for 2014) p-value (0.00088) is less than 0.05, so we can't conduct the two-sample t-test because the variance differs
print("Levene test(alcohol_consumption~depressive_symptoms):\n", levene(y1, y2))
# let's conduct paired samples t-test: (for 2014) p-value (0.76) is not less than 0.05, so there is no significant interaction effect between the alcohol consumption and the depressive symptoms
print("Paired samples t-test results:\n", stats.ttest_rel(y1, y2))
print('\n2014 YEAR:')
conductTests(res_2014)
print('\n2019 YEAR:')
conductTests(res_2019)
2014 YEAR: ShapiroResult(statistic=0.8726333975791931, pvalue=0.002773334039375186) ShapiroResult(statistic=0.9453981518745422, pvalue=0.15160074830055237) Levene test(alcohol_consumption~depressive_symptoms): LeveneResult(statistic=12.400217564490896, pvalue=0.0008813484514476608) Paired samples t-test results: TtestResult(statistic=0.3047202215976416, pvalue=0.7629157459154288, df=27) 2019 YEAR: ShapiroResult(statistic=0.9108201861381531, pvalue=0.017963483929634094) ShapiroResult(statistic=0.9626134037971497, pvalue=0.38056883215904236) Levene test(alcohol_consumption~depressive_symptoms): LeveneResult(statistic=9.943354041321683, pvalue=0.0025946145492243135) Paired samples t-test results: TtestResult(statistic=-0.29872429438994647, pvalue=0.7673562674649895, df=28)
Let's test if the educational level has significant effect on the development of depressive symptoms (the data only for 2014 was used).
# one-way ANOVA: percentage of people having depressive symptoms - dependent vars, categories of education level - categorical vars
# get the table with ALL possible education levels
data_anova1 = statistics[(statistics['hlth_pb'] == 'DPR') & (statistics['age'] == 'TOTAL') & (statistics['sex'] == 'T') & (statistics['TIME_PERIOD'] == 2014) & (statistics['frequenc'] == 'DAY')]
# get rid of the 'TOTAL' education level (let's work only with three categories: 'ED0-2' 'ED3_4' 'ED5-8')
data_anova1 = data_anova1.drop(data_anova1[data_anova1['isced11'] == 'TOTAL'].index)
groups_frame_anova1 = pd.DataFrame({"Dep_symptoms":data_anova1['OBS_VALUE_dep'],"Edu_level":data_anova1['isced11']})
groups_by_category_of_edu_level = groups_frame_anova1.groupby("Edu_level")
edo2_group = groups_by_category_of_edu_level.get_group('ED0-2')['Dep_symptoms']
ed34_group = groups_by_category_of_edu_level.get_group('ED3_4')['Dep_symptoms']
ed58_group = groups_by_category_of_edu_level.get_group('ED5-8')['Dep_symptoms']
# P(1.05e-13) << 0.05. It means that the educational level has significant effect on the development of depressive symptoms
print("one-way ANOVA:\n", stats.f_oneway(edo2_group, ed34_group, ed58_group), "\n")
one-way ANOVA: F_onewayResult(statistic=44.197417883476206, pvalue=1.0548398688765529e-13)
Let's test if the educational level and the frequency of alcohol consumption cause an effect on the percentage of population consuming alcohol (the data only for 2014 was used).
import statsmodels.api as sm
from statsmodels.formula.api import ols
# two-way ANOVA: percentage of people consuming alcohol - dependent vars, categories of consumption frequency AND educational levels - categorical vars
# get the table with ALL possible consumption frequencies and educational levels
data = statistics[(statistics['hlth_pb'] == 'DPR') & (statistics['age'] == 'TOTAL') & (statistics['sex'] == 'T') & (statistics['TIME_PERIOD'] == 2014)]
# consumption frequencies: 'DAY' 'LT1M' 'MTH' 'NM12' 'NVR' 'NVR_NM12' 'WEEK'
# get rid of the 'TOTAL' education level (let's work only with three categories: 'ED0-2' 'ED3_4' 'ED5-8')
data = data.drop(data[data['isced11'] == 'TOTAL'].index)
groups_frame = pd.DataFrame({"Alc_consumption":data['OBS_VALUE_alc'],
"Edu_level":data['isced11'],
'Frequency': data['frequenc']})
model = ols('Alc_consumption ~ C(Edu_level) + C(Frequency) + C(Edu_level):C(Frequency)', data=groups_frame).fit()
# P for C(Edu_level):C(Frequency) << 0.05 - there is significant interaction effect between educational level and the frequency of alcohol consumption
# Both factors do have statistically significant effect on the amount of people consuming alcohol (their p-values << 0.05)
print("two-way ANOVA:\n", sm.stats.anova_lm(model, type=2), "\n")
two-way ANOVA:
df sum_sq mean_sq F \
C(Edu_level) 2.0 1050.606565 525.303282 5.509303
C(Frequency) 6.0 35042.773061 5840.462177 61.253900
C(Edu_level):C(Frequency) 12.0 14844.726531 1237.060544 12.974107
Residual 567.0 54062.550714 95.348414 NaN
PR(>F)
C(Edu_level) 4.268662e-03
C(Frequency) 1.910906e-58
C(Edu_level):C(Frequency) 1.069772e-23
Residual NaN
Statistical analysis was conducted only for the small sample of data (with 'TOTAL' values for age, sex, educational level). The interactive plot lets the user see the correlation between alcohol consumption and depressive symptoms development in european countries for different sex, age, educational level and alcohol consumption frequency.
Legends:
from bokeh.models import FactorRange
from bokeh.plotting import figure
from bokeh.io import output_notebook
from bokeh.plotting import ColumnDataSource
from bokeh.palettes import Spectral6
from bokeh.transform import factor_cmap
from bokeh.resources import INLINE
output_notebook(INLINE)
def buildPlot(source):
countries = source['geo'].values
values = ['AC', 'DS']
data = {'geo' : countries,
'AC' : source['OBS_VALUE_alc'].values,
'DS' : source['OBS_VALUE_dep'].values
}
x = [ (country, value) for country in countries for value in values ]
counts = sum(zip(data['AC'], data['DS']), ())
legends = [v[1] for v in x ]
source_column = ColumnDataSource(data=dict(x=x, counts=counts, legends = legends))
p = figure(x_range=FactorRange(*x), title="Correlation between alcohol consumption and depressive symptoms development", plot_width=1100, plot_height=600)
p.vbar(x='x', top='counts', width=1, source=source_column, line_color="white", legend_field="legends",
fill_color=factor_cmap('x', palette=Spectral6, factors=values, start=1, end=2))
p.y_range.start = 0
p.x_range.group_padding = 3.0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None
p.yaxis.axis_label = "Population percentage"
return p
def GetPlot(symptoms, age, sex, year, freq, edu_level):
source = getTable(statistics, symptoms, age, sex, year, freq, edu_level)
return buildPlot(source)
import panel as pn
from panel import widgets
pn.extension()
frequencies = ['DAY', 'LT1M', 'MTH', 'NM12', 'NVR', 'NVR_NM12', 'WEEK']
select_freq = widgets.Select(
name='Select alcohol consumption frequency',
options=frequencies,
size=1)
time_period = [2014, 2019]
select_time = widgets.Select(
name='Select time period',
options=time_period,
size=1)
depr_symptoms = ['DPR', 'DPR_MJR', 'DPR_OTH']
select_symptoms = widgets.Select(
name='Select depressive symptoms',
options=depr_symptoms,
size=1)
sex = ['F', 'M', 'T']
select_sex = widgets.Select(
name='Select sex',
options=sex,
size=1)
age = list(statistics['age'].unique())
select_age = widgets.Select(
name='Select age',
options=age,
size=1)
edu_level = list(statistics['isced11'].unique())
select_edu_level = widgets.Select(
name='Select educational level',
options=edu_level,
size=1)
stacked_plot = pn.bind(GetPlot, symptoms=select_symptoms, age=select_age, sex=select_sex, year=select_time, freq=select_freq, edu_level=select_edu_level)
pn.Column(pn.Row(select_freq, select_time, select_symptoms), pn.Row(select_sex, select_age, select_edu_level), stacked_plot)
The interactive plot showed that three countries which have the largest percentage of population drinking alcohol on daily basis are PT (Portugal, almost 40%!), ES (Spain) and IT (Italy) in 2014. Other three countries which population never consumes alcohol are CY (Cyprus, about 45%), HR(Croatia) and IT (Italy) in 2014. Plot doesn't show any noticeable correlation between the development of depressive symptoms and alcohol consumption (as well as the t-test). It also proves that there is the correlation between the educational level and the development of depressive symptoms (the one-way ANOVA test) - the higher the education level the smaller amount of population struggles with depressive symptoms. It also can be seen that most people consume alcohol several times per month or monthly.